The data contains classifications of blood samples, performed by “Accellix” (“the company”). There are 2 types of classifications, manual and automatic, both are done by the company instruments. The goal is to analyze the differences between the classification type in order to imporve the automatic classification. There are 4 blood samples, each one is classified several times (the replicates), both manually and automatically, and examined with respect to the blood markers, CD45+.
CD45+(CD stands for cluster of differentiation) is a Protein tyrosine phosphatase, receptor type, C is also known as PTPRC is an enzyme that, in humans, is encoded by the PTPRC gene.
The data is loaded from an Excel file, here are the main columns and their sedcription:
| Columns of the excel file | Description of the data |
|---|---|
| Replicate Samples | Blood sample with repetition. |
| Observation name | Name of the files with the observation form the sample. |
| %Viable CD45+/CD45+ Singlet (Manual classification) | Percentage of Viable CD45+/CD45+ Singlet from the traditional flow cytometry Manual classification by Biologist. |
| %Viable CD45+/CD45+ Singlet (Auto classification) | Percentage of Viable CD45+/CD45+ Singlet from the Automated flow cytometry Auto classification by Device. |
The Variance plots help has to understand the Scattering from the mean of each classification and assist in finding suspected outliers.
Suspected outliers form the plot:
Variance plot Manual data v.s Mean:
The scatter plot “Manual classification v.s Auto-classification” presents high correlation(0.687).
Based on the analysis above, we will mark the next observation as an outlier and remove it.
Our inference is using t-statistics, which require 1 of the following conditions:
From the density plots above we can see that the groups are approximately normally distributed, consists of independent classifications, thus it is reasonable to assume that our data is sampled from a normal distribution.
Sometimes, we would like to test our hypothesis on multiple coordinates, where each coordinate has its own meaning. For example we can look at \[ H_0:\mu=\mu_0\in\mathbb{R}^m \] as \(m\) test, and our \(H_0\) would be \[ H_0:\cap_{i=1}^m \mu_{i}=\mu_{0,i} \] Where \(\mu_i\) is the \(i^{th}\) coordinate of the vector \(\mu\), this hypothesis is also called the global null. Using our significance level \(\alpha\) to test each hypothesis may lead to problems, because even though we have \(\le\alpha\) doing a type I error in a single hypothesis, the chances could be much higher when we use the same \(\alpha\) for all our partial hypotheses. For example, if we’ll take \(\alpha=0.05,m=10\) and our hypotheses are independent coordinates (and identically distributes, e.g. each one consists of flipping a fair coin several times) then by comparing each coordinate to \(\alpha\) we have a chance of \[ 1-(1-0.05)^{10}\approx 0.4016 \] to do a type I error, which is way too big, so a different approach is needed.
As learned in class there are several procedures that were designed to address the multiple comparisons challenge. We will describe and implement the following (while mentioning and proving the simplest results):
Using the criterion FWER.
Note: there is another famous criterion we learned about - FDR, but since FDR is about proportions and we do not have many partial hypotheses (4) we will not address this matter.
Let \(m\) be the number of hypotheses, \(m_0\) the number of the hypotheses where \(H_{0,j}\) is true, \(R\) be the number of rejected hypotheses and \(V\) the number of falsely rejected hypotheses. (note that \(V\) and \(m_0\) cannot be observed), Then: \[\begin{equation} FWER=\mathbb{P}(V\ge 1) \end{equation}\]
The multiple comparisons procedures take a vector of p-values. We will generate 4 p-values for our 4 groups, based on a paired t-tests where the null hypothesis (for each group) is that the means difference (between auto and manual classification) is 0.
We use the paired t-test since we cannot assume independence between the manual and the auto classifiers, in fact, we hope there is a strong positive correlation between the classifiers (assuming the manual classification is “correct”).
#Extracting the p-value for each sample base on a paired t-test.
alpha = 0.05
dataCopy = data
colnames(dataCopy) <- c("sample", "classname", "manual", "auto")
#filter by the unique value in the vector i = 1 default
sample1 <- dataCopy %>% filter(sample == levels(factor(sample))[1])
sample2 <- dataCopy %>% filter(sample == levels(factor(sample))[2])
sample3 <- dataCopy %>% filter(sample == levels(factor(sample))[3])
sample4 <- dataCopy %>% filter(sample == levels(factor(sample))[4])
meanDiffSample1 = sample1$manual - sample1$auto
meanDiffSample2 = sample2$manual - sample2$auto
meanDiffSample3 = sample3$manual - sample3$auto
meanDiffSample4 = sample4$manual - sample4$auto
t1 = t.test(meanDiffSample1,conf.level = 1 - alpha)
t2 = t.test(meanDiffSample2,conf.level = 1 - alpha)
t3 = t.test(meanDiffSample3,conf.level = 1 - alpha)
t4 = t.test(meanDiffSample4,conf.level = 1 - alpha)
p = c(t1$p.value,t2$p.value,t3$p.value,t4$p.value)
The procedure is as follows: Given \(m\) hypotheses and significance level \(\alpha\), test each one with significance level of \(\frac{\alpha}{m}\). It is straight forward to show that the probability of falsely rejecting the global null is less than \(\alpha\): \[ \mathbb{P}_{H_0}(\cup_{j=1}^m p_j\le\frac{\alpha}{m})\le\sum_{j=1}^m \mathbb{P}_{H_0}(p_j\le\frac{\alpha}{m})\overset{p_j\overset{H_0}{\sim}U[0,1]}{=}\sum_{j=1}^m \frac{\alpha}{m}=\alpha \] There are 2 ways of controlling FWER, the strong version holds no matter which hypotheses are true, the weak version holds only under the global null. Bonferroni correction controls FWER in the strong sense and the proof is short: Let \(v_j\) be the indicator of \(H_{0,j}\), i.e. it equals \(1\) if the hypothesis is rejected and \(0\) otherwise, thus \[ V=\sum_{j\in H_0} v_j \] hence \[ \mathbb{E}[V]=\sum_{j\in H_0} \mathbb{E}[v_j]=\sum_{j\in H_0}\mathbb{P}(v_j=1)=\sum_{j\in H_0}\mathbb{P}(v_j\le\frac{\alpha}{m})\overset{p_j\overset{H_0}{\sim}U[0,1]}{=}m_0\frac{\alpha}{m}\le\alpha \] Now, since \(V\) is a non-negative random variable, we have \[ FWER=\mathbb{P}(V\ge 1)\le\sum_{j=1}^m \mathbb{P}(V\ge m)=\mathbb{E}[V]\le\alpha \]
bonferroni_sig <- p.adjust(p, "bonferroni") < alpha
bonferroni_sig
## [1] FALSE FALSE TRUE TRUE
The procedure do not reject the null hypothesis for groups 1 and 2.
It does reject the null hypothesis for for groups 3 and 4.
As wee see below, Bonferroni is the most conservative when rejecting hypotheses.
The algorithm goes as follows:
It is easy to show that Holm’s procedure controls FWER in the strong sense:
Let \(H_0\) be the set of indexes of the true \(H_{0,j}\) hypotheses.
And let \(j_0=\min(H_0)\) which corresponds to the smallest hypothesis we can reject falsely.
Since there are at most \(m-m_0\) hypotheses in \(H_0^c\) we can deduce that \(j_0\le m-m_0+1\).
We will do a type I error only \[ p_{(j_0)}\le\frac{\alpha}{m-j_0+1}\le\frac{\alpha}{m_0} \] Thus \[ \mathbb{P}(\min_{j\in H_0}\le\frac{\alpha}{m_0})\overset{union\;bound}{\le}\sum_{j\in H_0} p_j\le\frac{\alpha}{m_0}\overset{p_j\overset{H_0}{\sim}U[0,1]}{=}m_0\frac{\alpha}{m_0}=\alpha \]
holm_sig <- p.adjust(p, "holm") < alpha
holm_sig
## [1] FALSE TRUE TRUE TRUE
The procedure do not rejects the null hypothesis for group 1 but do reject for the rest.
As can be seen, though Bonferroni and Holm both controls FWER in the strong sense, Bonferroni is indeed more conservative in rejecting hypotheses.
The algorithm goes as follows:
Then under \(H_0\), \(T_m\sim U[0,1]\) and assuming the p-values are independent, the procedure controls FWER (the proof is a bit long and was brougt in a dedicated file on Moodle).
sorted_p = sort(p)
T = c(4/1,4/2,4/3,4/4)
T = sorted_p * T
T_m = min(T)
T_m <= alpha
## [1] TRUE
Simes method tells us whether we reject the global null.
As might expected, once Bonferroni rejects half of the hypotheses, and Holm rejects 75% - the global null would be rejected.
As the most conservative procedure, Bonferroni correction rejects 50% of the hypotheses, Holm rejects 75% of them and Simes rejects the global null.
The fact that Bonferroni and Holm did not rejected all the hypotheses indicates that our auto classifier might be in the right way and is probably better than a random classifier.
The fact that the majority is rejected by Holm + the global null is rejected, indicates that the classifier should be further trained.
Important note - indeed, our classifier didn’t do a great job here, however, our criterion - \(H_0:\Delta=0\) is very demanding, hence, a look at simultaneous confidence intervals might be insightful.
A t-test confidence interval is of the form: \[ CI^\alpha=[\bar{x}-t_{n-1,1-\frac{\alpha}{2}}\cdot\frac{S}{\sqrt{n}},\bar{x}+t_{n-1,1-\frac{\alpha}{2}}\cdot\frac{S}{\sqrt{n}}] \] Where \(\alpha\) is our chosen significance level and \(S^2=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2\). However, as in hypotheses testing, this approach can fail when we have multiple comparisons. For example, let \(T_j\sim N(\theta_j,1)\) independent and we construct \(CI_j^\alpha=[T_j-Z_{1-\frac{\alpha}{2}},T_j+Z_{1-\frac{\alpha}{2}}]\) then \[ \mathbb{P}_\theta(\cap_{j=1}^m\{\theta_j\in CI_j\})=\cap_{j=1}^m \mathbb{P}_{\theta_j}(\theta_j\in CI_j)=(1-\alpha)^m < 1-\alpha \] And we don’t have the wanted confidence. So, as before, a different approach is required.
Definition: A confidence region covers simultaneously \(\theta=(\theta_1,...,\theta_m)\) with confidence level \(1-\alpha\) if \(\forall \theta\in\mathbb{R}^m\) we have \(\mathbb{P}_\theta(\theta\in CI)\ge 1-\alpha\). We will implement on our data a simultaneous confidence interval method using Bonferroni correction, that is: \[ CI_j^{\alpha Bon}=[\bar{x_j}-t_{n_j-1,1-\frac{\alpha}{2m}}\cdot\frac{S_j}{\sqrt{n_j}},\bar{x_j}+t_{n_j-1,1-\frac{\alpha}{2m}}\cdot\frac{S_j}{\sqrt{n_j}}] \] and \[ CI^{\alpha Bon}=CI_1^{\alpha Bon}\times\cdots \times CI_m^{\alpha Bon} \] Where \(x_j,n_j,S_j\) keep their usual meaning with respect to group \(j\) (if each region is an interval, the the global region would be an \(m\)-dimensional box).
From the above, we can conclude that given a significance level \(\alpha\), the confidence interval for group \(j\) would be \(CI_j^{\alpha Bon}=CI^{\alpha/m}\).
tb1 = t.test(meanDiffSample1, conf.level = 1 - alpha/4)
CI1 = tb1$conf.int
tb2 = t.test(meanDiffSample2, conf.level = 1 - alpha/4)
CI2 = tb2$conf.int
tb3 = t.test(meanDiffSample3, conf.level = 1 - alpha/4)
CI3 = tb3$conf.int
tb4 = t.test(meanDiffSample4, conf.level = 1 - alpha/4)
CI4 = tb4$conf.int
CI1
## [1] -2.1978808 0.4935951
## attr(,"conf.level")
## [1] 0.9875
CI2
## [1] -5.5157292 0.3457292
## attr(,"conf.level")
## [1] 0.9875
CI3
## [1] 0.1700911 4.1779089
## attr(,"conf.level")
## [1] 0.9875
CI4
## [1] 1.067935 4.917065
## attr(,"conf.level")
## [1] 0.9875
We have generated four \(1-\alpha/4=0.9875\) confidence intervals for the classifiers mean differences.
The first confidence interval is the smallest, this might indicate a smaller variance, however, we shouldn’t rush into any conclusions since our first group is the biggest (about 2.5 times than group 2), and all of them are rather small.
Notice that the CI’s of groups 1 and 2 contain 0, while the others do not, in fact, the CI for group 4 is quite far from containing 0.
Hence, we can deduce that the auto classifier did better on groups 1 and 2, and that we should recheck it on group 3 and 4 again, while having more observations.
The Auto classifier showed indications of success while performing multiple Hypotheses testing, at the same time - the results are not good enough - this might be explained by the strict criterion of \(\Delta=0\).
The simultaneous confidence intervals are indicating that our classifier is in the right direction though it requires more training.
A good way to move forward is to add more observations, both to the existing groups and to new groups.
This way, we could continue improving our classifier, our inference would be better (and less variable), in addition, more groups might allow us to use FDR controlling procedures and FCR-controlled simultaneous confidence intervals.